This notebook attempts to analyze the number of disk reads required for different formats of a hash table that uses the Cuckoo format for collison handling.
Disk reads are calculated theorectically by assuming that each time a (key, value) pair is moved, that this new section of our data structure will need to be read from disk storage.
import math
import numpy as np
import pandas as pd
from itertools import product
from functools import partial
from scipy.stats import poisson, geom, chisquare, chi2
import dask.dataframe as dd
from dask.distributed import Client, progress
client = Client(n_workers=2, threads_per_worker=2, memory_limit='4GB')
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
from bokeh.layouts import gridplot
from bokeh.resources import INLINE
pd.set_option('display.max_rows', 500)
Here we are getting the schema and setting up our data for the analysis
filenames = [f"/mnt/mybook/MonteCarloCuckoo{i}.csv" for i in range(2, 5)]
data_types = {
"Unnamed: 0": "int64",
"Util": "float64",
"NumReads": "int64",
"NumTables": "float64",
"LenTables": "int64",
"NumElem": "int64",
"Hash1": "object",
"Hash2": "object",
"Hash3": "object",
"Hash4": "object"
}
df2 = dd.read_csv(filenames[0], dtype=data_types)
df3 = dd.read_csv(filenames[1], dtype=data_types)
df4 = dd.read_csv(filenames[2], dtype=data_types)
We want to understand some basic information about our dataset. This will be a breif analysis of number of rows and distribution
# number of data points for 2 table variations
df2.shape[0].compute()
14914614
# number of data points for 3 table variations
df3.shape[0].compute()
20095628
# number of data points for 4 table variations
df4.shape[0].compute()
15142154
# max load factor for 2 table variations
df2["Util"].max().compute()
0.9261744966442952
# max load factor for 3 table variations
df3["Util"].max().compute()
0.9710982658959536
# max load factor for 4 table variations
df4["Util"].max().compute()
0.9692760089231394
Here we look at a couple of lots that show a 1% sample of our data in order to understand the results better. This is used to inform the direct of a regression or other metric for empirically defining the number of disk reads a Hash Table would require if it was not stored in main memory.
We see the number of disk reads required appears to grow exponentially with the load factor (also called the utilization ratio here). Additionally, the growth rate of this hypothetical function is different between the number of tables. This follows from the theoretical likelihood for the number of operations required for insertion given the load factor and number of tables.
merged_df = dd.concat([df2.sample(frac=0.05), df3.sample(frac=0.05), df4.sample(frac=0.05)]).compute()
merged_df.head(10)
| Unnamed: 0 | Util | NumReads | NumTables | LenTables | NumElem | Hash1 | Hash2 | Hash3 | Hash4 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 915994 | 14629 | 0.558440 | 2 | 2.0 | 13099 | 19118 | xxHash | Murmur | NaN | NaN |
| 479921 | 2247 | 0.083625 | 5 | 2.0 | 13441 | 19614 | xxHash | Murmur | NaN | NaN |
| 389704 | 126 | 0.000453 | 4 | 2.0 | 140111 | 19614 | xxHash | Murmur | NaN | NaN |
| 565048 | 16257 | 0.091144 | 5 | 2.0 | 89189 | 19614 | xxHash | Murmur | NaN | NaN |
| 340057 | 7163 | 0.354268 | 4 | 2.0 | 10111 | 19614 | xxHash | Murmur | NaN | NaN |
| 186688 | 10162 | 0.274631 | 1 | 2.0 | 18503 | 19614 | xxHash | Murmur | NaN | NaN |
| 702517 | 11450 | 0.041927 | 1 | 2.0 | 136559 | 19118 | xxHash | Murmur | NaN | NaN |
| 187601 | 11075 | 0.299303 | 2 | 2.0 | 18503 | 19614 | xxHash | Murmur | NaN | NaN |
| 228399 | 12645 | 0.393050 | 1 | 2.0 | 16087 | 19614 | xxHash | Murmur | NaN | NaN |
| 1011086 | 16584 | 0.008456 | 1 | 2.0 | 980711 | 19614 | xxHash | CityHash | NaN | NaN |
merged_df[["Util", "NumReads"]].plot(x="Util", y="NumReads", kind="scatter")
<matplotlib.axes._subplots.AxesSubplot at 0xa4faddf0>
merged_df[merged_df["NumTables"] == 2][["Util", "NumReads"]].plot(x="Util", y="NumReads", kind="scatter")
<matplotlib.axes._subplots.AxesSubplot at 0xa4b5ed50>
merged_df[merged_df["NumTables"] == 3][["Util", "NumReads"]].plot(x="Util", y="NumReads", kind="scatter")
<matplotlib.axes._subplots.AxesSubplot at 0xa4b5eab0>
merged_df[merged_df["NumTables"] == 4][["Util", "NumReads"]].plot(x="Util", y="NumReads", kind="scatter")
<matplotlib.axes._subplots.AxesSubplot at 0xa4143650>
If we think about how we might express a function for the information found in this data, we should look at the log transform of the data in order to possibly have a linear regression; however, you will immediately notice that the data does not follow a linear pattern and that even a fitted regression to this data expresed by some unknown function would attempt to reduce the error (most commonly by RSS). Such a function would still poorly describe our data except by the fact that we have a concept of the expected case. Rather since we know our data structure has a worst-case ammoratized constant time search complexity, we we consider ourselves with a probablistic worst-case complexity by finding the 90%, 95%, and 99% thresholds for insertion complexity. To find the probablistic complexity we will look for a distribution that matches our underlying data.
plot = figure(plot_width=400, plot_height=400)
plot.scatter(
x=merged_df[merged_df["NumTables"] == 2]["Util"],
y=np.log(merged_df[merged_df["NumTables"] == 2]["NumReads"])
)
show(plot)